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We propose a generic Bayesian framework for inference in distri¬ 
butional regression models in which each parameter of a potentially 
complex response distribution and not only the mean is related to a 
structured additive predictor. The latter is composed additively of a 
variety of different functional effect types such as nonlinear effects, 
spatial effects, random coefficients, interaction surfaces or other (pos¬ 
sibly nonstandard) basis function representations. To enforce specific 
properties of the functional effects such as smoothness, informative 
multivariate Gaussian priors are assigned to the basis function co¬ 
efficients. Inference can then be based on computationally efficient 
Markov chain Monte Carlo simulation techniques where a generic 
procedure makes use of distribution-specific iteratively weighted least 
squares approximations to the full conditionals. The framework of 
distributional regression encompasses many special cases relevant for 
treating nonstandard response structures such as highly skewed non¬ 
negative responses, overdispersed and zero-inflated counts or shares 
Including the possibility for zero- and one-inflation. We discuss distri¬ 
butional regression along a study on determinants of labour incomes 
for full-time working males in Germany with a particular focus on 
regional differences after the German reunification. Controlling for 
age, education, work experience and local disparities, we estimate 
full conditional income distributions allowing us to study various dis¬ 
tributional quantities such as moments, quantiles or inequality mea¬ 
sures in a consistent manner in one joint model. Detailed guidance 
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on practical aspects of model choice including the selection of several 
competing distributions for labour incomes and the consideration of 
different covariate effects on the income distribution complete the dis¬ 
tributional regression analysis. We find that next to a lower expected 
income, full-time working men in East Germany also face a more 
unequal income distribution than men in the West, ceteris paribus. 

1. Introduction. The analysis of determinants of labour incomes has a 
long tradition in economics, dating back at least to Mincer (1974). His clas¬ 
sical wage equation includes potential labour market experience as well as 
years of education as the most important determinants of human capital 
which then translates into expected income [Lemieux (2006)]. Additional 
possible determinants include age, actually realised labour market experi¬ 
ence, gender, regional information concerning the residence of employees, 
or area of employment. One considerable restriction of most analyses con¬ 
ducted so far is their sole focus on the expected income given covariates, 
that is, the conditional mean. In some cases, distributions are required, for 
example, for inequality decomposition or to account for incomplete infor¬ 
mation due to truncation or censoring. Then, the (log-)normal distribution 
[Greene (2008), Chapter 19, Morduch and Sicular (2002)] is often implicitly 
considered (again with regression effects only on the mean) or one reverts 
to local analyses by means of quantile regression [Autor, Katz and Kearney 
(2008), Galvao, Lamarche and Lima (2013)]. More flexible types of distri¬ 
butions have so far mostly been used to describe income distributions on a 
highly aggregated level, normally the national level [Kleiber (1996)]. 

We utilise detailed, longitudinal information on incomes available from 
the German socio-economic panel (SOEP) to derive a flexible, structured 
additive distributional regression model for labour incomes of full-time male 
workers. We consider several candidate distributions for describing the non¬ 
negative conditional income distributions, including the log-normal distri¬ 
bution, the gamma distribution, the inverse Gaussian distribution and the 
Dagum distribution. To obtain flexible models, we allow for regression effects 
on potentially all parameters of the income distribution, thereby overcoming 
the previous concentration on expected incomes. As an illustration, consider 
the income distributions visualised in Figure 1 corresponding to an “aver¬ 
age,” full-time male worker with/without higher education in East and West 
Germany. Here we find that the income distributions differ considerably not 
only in terms of their expectation but also with respect to other aspects 
of the distribution, like the variance (see Section 4 for more details on the 
analysis). 

Some earlier attempts to define distributional regression models comprise 
Biewen and Jenkins (2005) or Donald, Green and Paarsch (2000). Biewen 
and Jenkins (2005) suggest to decompose the population into a coarse set 
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Fig. 1. SOEP data. Four conditional income distributions for 42-year-old males with 19 
years of working experience without higher education (left) or with higher education (right) 
and living in the East (dashed lines) or West (solid lines). Densities shown are posterior 
means of the densities in our best model DA.Ml; compare Section 3 for details on the 
model specification. 


of subgroups for which parametric income distributions are estimated such 
that the distributional form varies over the subgroups. Donald, Green and 
Paarsch (2000) propose to vary location and scale parameters with respect 
to covariates while the general shape of the distribution remains fixed over 
the covariate set. Building on their work, we propose to combine these ap¬ 
proaches in the sense that conditional income distributions are modelled 
parametrically as suggested by Biewen and Jenkins (2005), while allowing 
for variation in the whole distribution (not just location and scale) with 
respect to covariates as specified by Donald, Green and Paarsch (2000). 

Differences between East and West Germany have received considerable 
attention in the economic literature [Biewen (2000), Fuchs-Schiindeln, 
Krueger and Sommer (2010), Kohn and Antonczyk (2011)] and also con¬ 
sistently played a major role in the domestic political debate. Instead of 
solely taking a macroeconomic perspective to look at income inequality in 
the East and West at a highly aggregated level, we build a microeconomic 
foundation to the analysis of income inequality. Thereby, we consider the 
effect of various covariates on the conditional individual income distribution 
underlying the aggregate income distribution. It is our hypothesis that there 
are not only significant differences between East and West in the conditional 
mean income but also in the conditional income inequality aggravating the 
economic divide more than two decades after the reunification. 
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As a conceptual framework for our analyses, we extend the Bayesian struc¬ 
tured additive distributional regression models recently proposed in Klein, 
Kneib and Lang (2015) for zero-inflated and overdispersed count data regres¬ 
sion to general types of univariate distributions. In this class of regression 
models, all parameters of a potentially complex response distribution are 
related to additive regression predictors in the spirit of generalised additive 
models (GAMs). While the latter assume responses to follow a distribu¬ 
tion from the exponential family and focus exclusively on relating the mean 
of a response variable to covariates [see, e.g., Ruppert, Wand and Carroll 
(2003), Fahrmeir, Kneib and Lang (2004), Fahrmeir et ah (2013), Wood 
(2004, 2008)], distributional regression enables the consideration of basically 
any response distribution and allows to specify regression predictors for all 
parameters of this distribution. The main advantage of distributional regres¬ 
sion is that it provides a broad and generic framework for regression models 
encompassing continuous, discrete and mixed discrete-continuous response 
distributions and therefore considerably expands the common exponential 
family framework. 

Distributional regression is closely related to generalised additive mod¬ 
els for location, scale and shape (GAMLSS) as suggested by Rigby and 
Stasinopoulos (2005). We prefer the notion of distributional regression for 
our approach since in most cases, the parameters of the response distribution 
are in fact not directly related to location, scale and shape but are general 
parameters of the response distribution and only indirectly determine loca¬ 
tion, scale and shape. For example, in case of the Dagum distribution, there 
are three distributional parameters, but none of them is directly related to 
a measure of location which is jointly determined by all three parameters. 

In GAMLSS, inference is commonly based on penalised maximum likeli¬ 
hood estimation achieved via backfitting loops over the additive predictor 
components. In this paper, we consider a generic Bayesian treatment of 
distributional regression relying on Markov chain Monte Garlo simulation 
algorithms. To construct suitable proposal densities, we follow the idea of 
iteratively weighted least squares proposals [Gamerman (1997), Brezger and 
Lang (2006)1 and use local quadratic approximations to the full conditionals 
in order to avoid manual tuning. Utilising explicit derivations of the score 
function and expected Fisher information in these approximations consid¬ 
erably enhances numerical stability as compared to using numerical deriva¬ 
tives and the observed Fisher information (which are frequently used in 
the R add-on package gamlss implementing penalised likelihood inference). 
The Bayesian approach also has the advantage to provide credible intervals 
without relying on asymptotic arguments. The full potential of distributional 
regression is only exploited when the regression predictor is broadened be¬ 
yond the scope of simple linear or additive specifications. We will consider 
structured additive predictors [Brezger and Lang (2006), Fahrmeir et al. 
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(2013)] where each predictor is determined as an additive combination of 
various types of functional effects, including nonlinear effects of continuous 
covariates, spatial effects, random effects or varying coefficient terms. 

Alternatives to distributional regression are provided by quantile and ex- 
pectile regression which also allow us to go beyond studying the mean by 
focusing on local features of the response distribution, indexed by a pre¬ 
specified asymmetry parameter (the quantile or expectile level); see Koenker 
and Bassett (1978), Newey and Powell (1987) for the original references and 
Koenker (2005), Yu and Moyeed (2001), Schnabel and Eilers (2009), Sobotka 
and Kneib (2012) for more recent overviews. Single quantiles or expectiles 
are elicitable [Osband and Reichelstein (1985), Gneiting (2011a)] by consid¬ 
ering asymmetrically weighted loss functions and consistent estimates can 
be obtained under rather mild conditions on the conditional distribution of 
the responses (basically reducing to independence and the correct specifi¬ 
cation of the quantity of interest). However, when interest focuses on the 
complete conditional distribution or if distributional quantities such as the 
Gini coefficient for inequality that are not elicitable by specifying a corre¬ 
sponding loss function are desired, the direct specification of distributional 
regression turns out to be advantageous. 

Many of the aspects discussed in the remainder of this paper (such as 
choice of a suitable response distribution and adequate predictor specifica¬ 
tions, Bayesian inference, interpretation of estimation results) are relevant 
beyond our application. We therefore provide an analysis on the propor¬ 
tion of farm outputs achieved by cereals in the application Supplement A 
[Klein et al. (2015b), Section A.2], to this paper as a second example on 
distributional regression. 

The remainder of the paper is structured as follows: Section 2 provides 
a detailed introduction to distributional regression and Bayesian inference 
along our case study on labour incomes. Model choice concerning the type of 
the response distribution and the specification of the regression predictors is 
treated in Section 3. Given the selected models. Section 4 provides empirical 
results on the regional disparities of conditional incomes in East and West 
Germany. Additional material on the application is provided in the applica¬ 
tion supplement Section A.l. Section 5 provides a summary and comments 
on directions for future research. Einally, we summarise general aspects of 
distributional regression with other types of responses in the methodological 
Supplement B [Klein et al. (2015c)] which also comprises details on Bayesian 
inference, derivations of required quantities for the iteratively weighted least 
squares proposals and simulation studies. 

2. Distributional regression. As a conceptual framework for our anal¬ 
ysis of labour incomes and their regional disparities, we consider distribu¬ 
tional regression models where, conditional on all available covariate in¬ 
formation summarised in the vector I'i , the response variables yi,... ,yn 
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are assumed to be independently distributed with ii'-parametric densities 
PiUiWii ,..., 'dix) = Pi - The conditional distribution pj of observation yi given 
Vi is indexed by the (in general covariate-dependent) distributional parame¬ 
ters dii,..., 'diK- Each parameter k = 1,... ,K is then related to a semi- 
parametric, additive predictor defined in terms of (potentially different) 
subvectors of the covariate vector Vi. Similarly, as in generalised linear mod¬ 
els, a suitable (one-to-one) response function is utilised to map the predictor 
to the parameter of interest, that is, 'dik = where the superscript 

in the predictors and response functions indicates that we are dealing 
with K predictors specific to the different distributional parameters instead 
of only one single predictor as in mean regression. The response function 
is chosen to ensure appropriate restrictions on the parameter space such 
as the exponential function -dik = exp(7/f'') to ensure positivity. We discuss 
specihc choices for distributional regression of labour incomes after having 
introduced our data in more detail. 

2.1. German labour income data. For studying conditional income distri¬ 
butions in Germany, we utilise information from the German Socio-Economic 
Panel [Wagner, Frick and Schupp (2007)]. More specifically, we consider real 
gross annual personal labour income in Germany as defined in Bach, Gorneo 
and Steiner (2009) for the years 2001 to 2010. We deflate the incomes by the 
consumer price index [Statistisches Bundesamt (2012)], setting 2010 as our 
base year. Thus, all incomes are expressed in real-valued 2010 Euros from 
here on. 

Following the standard literature, we only look at the income of males in 
full-time employment [see, among others, Dustmann, Ludsteck and Schonberg 
(2009), Gard, Heining and Kline (2013)] in the age range 20-60. This yielded 
7216 individuals for whom we considered the income trajectories from the 
ten year period. For each individual, we used every observation for which 
all required dependent and independent variables were available, yielding a 
total of n = 40,965 observations. Naturally, this implies that for some indi¬ 
viduals we do not have full longitudinal coverage over the whole ten year 
period. 

As covariates, we consider educational level measured as a binary indi¬ 
cator for completed higher education (according to the UNESGO Interna¬ 
tional Standard Glassification of Education 1997 provided in the SOEP) in 
effect coding (educ), age in years {age), previous labour market experience 
in years (Imexp), the calendar time (t), information on the geographical 
district {Raumordnungsregion) representing the area of residence (s) and a 
binary indicator in effect coding for districts belonging to the eastern part 
of Germany {east). A description of the data set is given in Table Al; de¬ 
tails on the specifications for the different effect types will be provided in 
Section 2.3. 
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A common assumption in economic analyses of income is that incomes yi 
are log-normally distributed with covariate-dependent location parameter ry 
(corresponding to the mean of the log-transformed incomes) and a constant 
scale parameter . For an observation i collected at time point U, a suitable 
semiparametric predictor (dropping the dependence on the parameter 
could then be specified as 

Vi = 1^0 + educiPi -b fiiagci) + educif2{agei) 

( 1 ) 

“b “1“ /spat('Si) “b /time(^i)) 

where /3o represents the overall intercept, j3i captures the effect of higher 
education, fi{age) and f 2 {age) are nonlinear effects of age capturing also 
the interaction with the educational status, f^{lmexp) is the nonlinear effect 
of previous labour market experience, /spat('S) is a spatial effect capturing 
heterogeneity at the level of the districts s, and /time(i) is an effect specihc 
for the calendar year t. In a second step, the spatial effect can further be 
decomposed into 

(2) /spat(s) = easf^Ti -b 5str(s) + 9unstr('S), 

where 71 captures the difference between the eastern and western part of 
Germany and S'str('S) and q nn st.rfs) represent spatially structured (smooth) 
or unstructured (unsmooth) district-specific effects. Note that, in addition 
to the East-West indicator east, more district-specific information could 
be included if desired. While this decomposition could simply be plugged 
into ( 1 ) to obtain a reduced-form specification, it can also be interpreted 
as a hierarchical multilevel specification where we differentiate between an 
individual-specihc level in (1) and a region-specihc level in (2). Further de¬ 
tails on the predictors and associated priors will be discussed in Section 2.3. 

2 . 2 . Potential response distributions. One of the great advantages of 
structured additive distributional regression is the wide range of distribution 
types that can be modelled. Since labour income is by definition positive, 
we will restrict ourselves to four nonnegative distributions summarised in 
Table 1. For a more comprehensive list of distributions supported by the 
distributional regression framework, see Section B.1.1. 

As noted, the standard conditional distribution type in econometric in¬ 
come analyses is the log-normal distribution. Next to its theoretical appeal 
from an economic perspective [see Arnold (2008), page 122 ], it has the ad¬ 
vantage that it makes the vast statistical inference machinery built around 
Gaussian regression available to researchers. However, Atkinson (1975) and 
others have noted that, at least for the aggregate income distribution, the 
log-normal distribution ht is problematic at times, especially for the upper 
tail of the distribution. 
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Table 1 

Selected candidate distributions; for a more comprehensive list see Table B1 


Name 


Density 


Parameters 


Response functions 


Log-normal 
Inverse Gaussian 


Dagum 


Gamma 



fi,<7>0 (t]) — {r]) — e->cp{r)) 

a,b,c > 0 — h^{r)) — h^{r]) — exp(77) 


iu) =exp{ri) 

h>^(r)) = h'’^(rj) = exp(r)) 


Partly as a consequence, various other distribution types have thus been 
suggested for the modelling of income distributions. Salem and Mount (1974) 
proposed the gamma distribution as a suitable alternative to the log-normal 
distribution. One of its advantages is that its estimation is possible within 
the framework of generalised linear models as the distribution belongs to the 
exponential family (as long as covariate effects are restricted to the mean). 

The third distribution we consider also belongs to the exponential family 
(if the second parameter is assumed to be independent of covariates). The 
inverse Gaussian distribution has to our knowledge not been used in the 
context of modelling income distributions yet. But for other nonnegative 
distributions with a similar economic rationale, like the distribution of claim 
sizes arising in car insurance [Heller, Stasinopoulos and Rigby (2006), Klein 
et al. (2014)], it has shown to perform well due to its flexibility in modelling 
extreme right skewness. As it is conceivable that some conditional income 
distributions also portray such extreme skewness, we decided to also consider 
this distribution type. 

The last distribution we consider is the Dagum distribution [Dagum (1977)] 
which belongs to the beta-type size distributions that have seen considerable 
attention in the literature on modelling (aggregate) income distributions [see 
Kleiber and Kotz (2003)]. One of its appealing properties is that towards 
the upper end of the distribution its shape mirrors the one of the Pareto 
distribution which is generally assumed to provide a good approximation 
for the income distribution for the top percentiles of the (aggregate) income 
distribution [Piketty and Saez (2007)]. 

2.3. Structured additive predictors and associated priors. 

Generic representation. While considering a specific instance of a struc¬ 
tured additive predictor for the analysis of income, a generic structured 
additive predictor for parameter dik is given by 



( 3 ) 
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where /3o represents the overall level of the predictor and the functions 
j = 1 ,..., Jfc, relate to different covariate effects defined in terms 
of the complete covariate vector i/j. Note that each distribution parameter 
may depend on different covariates and a different number of effects J^, but 
we suppress this possibility (as well as the parameter index) in the following. 

In structured additive regression, each function fj is approximated by a 
linear combination of Dj appropriate basis functions, that is, 

dj=l 

such that in matrix notation we can write fj = • • •, fjit^n))' = 

where Zj[i,dj] = Bj^ci.{ui) is a design matrix and Pj is the vector of coef¬ 
ficients to be estimated. To ensure identifiability specihc constraints repre¬ 
senting for example centring of the functional effects are added, see Sec¬ 
tion B.2.2 for further details. The basis function representation then leads 
to the following matrix representation of the generic predictor (3): 

(4) r] = PqI + ZiPi + ■ ■ ■ + ZjPj. 

For each of the parameter vectors Pj we can then either assume a hierarchical 
specification, where Pj is related to another structured additive predictor 
(as in the case of the spatial effect in our example), or we directly assume 
the multivariate normal prior 

/ ]^ \ (rk(Kj))/2 / ^ X 

(5) ^ exp|^-^/3'Kj/3jj 

with (potentially rank-deficient) precision matrix Kj and prior smoothing 
variance tJ. The latter is assigned an inverse gamma hyperprior tJ ~ 
lG{aj,bj) (with aj = bj = 0.001 as a default option) in order to obtain a 
data-driven amount of smoothness. 

A detailed discussion of terms that fit into the generic predictor framework 
(in the context of mean regression) is provided in Fahrmeir, Kneib and Lang 
(2004) and Fahrmeir et al. (2013), Chapters 8 and 9. 

In the following, we will discuss suitable specifications and prior assump¬ 
tions for the hierarchical predictor defined in (1) and (2). Note that we drop 
the dependence on the distributional parameter indicated by the superscript 
■dfc, the observation index i and the function index j to simplify notation. 
Hierarchical extensions are treated in detail in Lang et al. (2014). 

Linear effects. For all parametric, linear effects, we assume a flat, nonin- 
formative prior. This may be considered the limiting case of a multivariate 
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Gaussian prior with high dispersion which can also be used to achieve regu- 
larisation in the case of high-dimensional parameter vectors. In our analyses, 
we assume linear effects for the intercept and the educational indicator, as 
well as for the East-West indicator. 

Continuous covariates. For the effects of age and previous work expe¬ 
rience, assuming a linear effect is probably too restrictive. We therefore 
consider P(enalised)-splines [Eilers and Marx (1996)] as a ffexible device for 
including potentially nonlinear effects f{x) of a continuous covariate x. In 
a first step, /(x) is approximated by a linear combination of D B-spline 
basis functions B^ix) that are constructed from piecewise polynomials of a 
certain degree I upon an equidistant grid of knots, f{x) = J2d=i l^dBd{x). 
To avoid the requirement of choosing an optimal number of knots together 
with optimal knot positions, Eilers and Marx (1996) regularise the func¬ 
tion estimate by augmenting a difference penalty to the fit criterion. In onr 
Bayesian framework, the stochastic analogue is to assume a first or second 
order random walk 

Pd Pd—1 T ^ 2, . . . , Zl, 

Pd = ‘^Pd-l — Pd-2 + £d, d = 3,...,D 

with Gaussian errors Sd ~ N(0,r^) and noninformative priors for /3i or /3i 
and P 2 [Lang and Brezger (2004)]. The joint prior of all basis coefficients 
(3 = (/?!,..., Pd)' can then be shown to be a (partially improper) multivariate 
Gaussian distribution with zero mean and precision matrix K = D'D, where 
D is a difference matrix of appropriate order. In our analysis, we nse twenty 
inner knots, a cubic spline basis and a second order random walk prior as 
the defanlt specihcation for penalised splines. 

In the case of the age effect, we allow for separate functions for individuals 
with high and low levels of education. This is achieved by the inclusion of the 
varying coefficient term [Hastie and Tibshirani (1993)] f 2 {age) such that the 
age effect is given by fi{age) — f 2 {age) for individuals with low educational 
level and fi{age) + f 2 {age) for individuals with high educational level. In 
this case, a penalised spline can be assumed for function f 2 {age) as well. 

Random effects. Penalised splines can in principle also be considered to 
represent the temporal effect ftimeft) in (!)• However, since in economic re¬ 
search temporal effects such as ours are generally considered by year-specific 
effects, we do not impose the smoothness assumption implied by penalised 
splines. We therefore consider a random effects specification where separate 
regression effects Pt = are assumed for the distinct time points. An 

i.i.d. Gaussian prior with random effects variance is then placed on the 
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coefficients (3 = (/3i,...,Similarly, random effects priors can be used 
for any other grouping variable with levels {1 ,..., G} present in the data. 

Note that we have not included individual-specific random effects. The 
reason for this is that we are specifically interested in the unobserved het¬ 
erogeneity among individuals with similar covariate sets which finds expres¬ 
sion in income inequality among them. In some sense our analysis is thus 
systematically different from standard regression techniques which pursue 
to eradicate the stochastic component or at least reduce it to a minimum. 
The inclusion of individual-specific effects goes a long way towards seem¬ 
ingly achieving this aim, as the share of the variance left to the error term is 
drastically reduced. However, the inferential gain obtained thereby could be 
expressed as follows: including individual-specific effects, we have found that 
incomes are largely different because individuals are different. While there 
are some analyses where such eradication of variance is useful, it sheds little 
insights on the nature of inequality at the disaggregated level since we are 
unable to disentangle the differences between individuals in a meaningful 
way. 

Spatial effects. For the spatial effect /spatl^) defined upon the discrete, 
spatial variable s G {1,..., 5} which denotes the different regions in the data 
set, we assume a hierarchical predictor specification following Lang et al. 
(2014). In fact, equation (2) merely defines a second structured additive 
predictor where now the distinct spatial regions define the unit of observa¬ 
tion. As a consequence, any type of regression effect that is specific for the 
region can be included on this level. In our case, the East-West indicator is 
one such example that is assigned a parametric effect with flat prior. 

In addition, we consider the spatially structured and spatially unstruc¬ 
tured effects fl'str(s) and 5'unstr('S), respectively. In both cases, separate regres¬ 
sion effects /3str,s = fi'str(s) and /3unstr,s = fl'unstr('S) are assumed for each of the 
regions, but the effects differ in terms of their prior assumptions. For the 
structured spatial effect, we assume spatial correlations defined implicitly by 
assuming a Gaussian Markov random field prior [Rue and Held (2005)] for a 
suitable neighbourhood structure derived from the spatial orientation of the 
data. The most common case would be to treat two regions as neighbours 
if they share a common boundary. If ds denotes the set of all neighbours of 
region s, the Markov random field prior then assumes 



( 6 ) 


with number of neighbours of region s denoted as Ng. Consequently, the 
conditional mean of /Istr,s given all other coefficients is the average of the 
neighbouring regions. It can be shown that the conditional normal distribu¬ 
tions specified in (6) correspond to a multivariate, partially improper normal 
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distribution with zero mean and precision matrix given by the adjacency 
matrix induced by the neighbourhood structure. 

For the unstructured spatial effect, we consider an i.i.d. Gaussian prior, 
that is, we assume a random effects prior specification. The rationale for 
considering both a structured and unstructured part of the spatial effect 
is that they are surrogates for unobserved spatial heterogeneity which may 
either be spatially structured (i.e., spatially smooth) or unstructured. 

2.4. Bayesian inference. To perform Bayesian inference, we consider 
Markov chain Monte Carlo (MCMC) simulation techniques and develop suit¬ 
able proposal densities based on iteratively weighted least squares (IWLS) 
approximations to the full conditionals. The derivation of the approxima¬ 
tions and the complete algorithm are documented in Section B.2. Here, we 
only sketch the essential parts. 

IWLS proposals for regression coefficients. The regression coefficients /3j 
are proposed from N(/.ij,P~^) with expectation and precision matrix 

A., = PT^Z'W(z - P, = Z;.WZ, + 4 k,-, 

where W is a diagonal matrix of working weights Wi =K{—d‘^l/drj‘f), z = 
T] + (W)“^v is a working response depending on the score vector v = dl/dr] 
and ri_j = r] — is the predictor without the jth component. The work¬ 
ing weights and the score vector are specific for the chosen response distribu¬ 
tion and induce an automatic adaptation to the form of the full conditional 
without requiring manual tuning. 

Updates for the smoothing variances. The smoothing variances tJ can be 
sampled in a Gibbs update where ~ with updated parameters 

Working weights. The specification of the working weights W involves 
the expectations of the negative second derivatives of the log-likelihood 
which improved both mixing and acceptance rate in comparison with the 
(seemingly simpler) approach of using the negative second derivative with¬ 
out deriving the expectation. Furthermore, invertibility of the precision ma¬ 
trix Pj is ensured for many distributions when using the expectation since 
the working weights are then nonnegative. Explicit derivations for both the 
distributions utilised for analysing labour incomes and the additional distri¬ 
butions summarised in Table B1 can be found in Section B.2.3. 
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Propriety of the posterior. Propriety of the posterior in distributional 
regression can be ensured when combining the assumptions considered in 
Klein, Kneib and Lang (2015) for count data regression with appropriate 
restrictions on the densities. These need to be bounded or integrable with 
respect to the predictors, whereby at least one observation fulfilling the 
latter assumption is required. Note that integrability of the densities can be 
assured by the assumption that none of the distributional parameters is on 
the boundary of the parameter space (an assumption that would also have 
to be made to apply standard maximum likelihood asymptotics). 

Software. Our Bayesian approach to distributional regression is imple¬ 
mented in the free, open source software BayesX [Belitz et al. (2015)]. As 
described in Lang et al. (2014), the implementation makes use of efficient 
storing mechanisms for large data sets and sparse matrix algorithms for sam¬ 
pling from multivariate Gaussian distributions. An R interface to BayesX is 
provided in the R add-on package bamlss [Umlauf et al. (2014)]. 

Empirical evaluation. We compared the empirical performance of the 
proposed Bayesian approach to the frequentist GAMLSS framework in two 
simulation scenarios and also investigated the performance of the deviance 
information criterion [DIG, Spiegelhalter et al. (2002)] for choosing between 
competing models. The studies and their outcomes are documented in more 
detail in Section B.3. A summary on the ability of the DIG for model choice 
is given in Section 3 and for the comparison with the frequentist approach 
(denoted as ML) in the following: 

(1) Comparison with ML in additive models. In purely additive models, 
the point estimates and corresponding posterior means, as well as their mean 
squared errors (MSEs), are very similar. However, coverage rates based on 
asymptotic maximum likelihood theory for ML are far too narrow in several 
distribution parameters. In particular, for the Dagum distribution, rates 
for all three parameters are far from the desired coverage level, while the 
credible intervals of the Bayesian approach are still reliable (albeit being 
usually slightly too conservative); compare, for example. Figure B2. 

(2) Comparison with ML in geoadditive models. 10% of the estimation 
runs of ML failed before convergence. MSEs of the spatial effect (based 
on a Markov random field) are slightly smaller for the Bayesian approach 
compared to ML. While the MSEs of the other effects do not deteriorate for 
our proposed method, we observe partly increasing MSEs for ML. 

3. Model choice. In any application of distributional regression, one 
faces important model choice decisions: choosing the most appropriate out 
of a set of potential response distributions and selecting adequate predictor 
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specifications for each parameter of these distributions. For our application 
on conditional income distributions, we consider the inverse Gaussian (IG), 
log-normal (LN), gamma (GA) and Dagum (DA) distribution as candidate 
distributions. A general predictor that could now be utilised for any of the 
parameters of these distributions was already introduced in equations (1) 
and (2). Instead of performing a complete stepwise model selection for each 
distribution, we study the following model specifications: 

(Ml) All distributional parameters are related to a predictor of type (1). 
For the spatial effect, we only include the unstructured effect since it turned 
out in exploratory analyses that the smooth component has only negligible 
impact. 

(M2) Instead of modelling all parameters in terms of covariates, the model 
structure of Ml is only applied to the parameters /r in the case of LN, IG 
and GA, and b in case of DA. The parameters a, c, a, cj^ are considered to be 
equal across all individuals. This corresponds to a usual GAM specification 
with focus on conditional means. 

(M3) All parameters are modelled in analogy to Ml except that the ran¬ 
dom effect for calendar time and the complete spatial effect (including the 
East-West indicator) are not included in the parameters a,c, u, 

In total, we therefore end up with 12 models to compare. In the following, 
we will discuss different options for conducting this comparison and will 
also comment on their wider applicability in the context of model choice for 
distributional regression. 

3.1. Deviance information criterion. The deviance information criterion 
(DIG) is a commonly used criterion for model choice in Bayesian infer¬ 
ence that has become quite popular due to the fact that it can easily 
be computed from the MGMC output. If is a MGMC sam¬ 

ple from the posterior for the complete parameter vector 0, the DIG is 
given by D{6) + pd = 2D{6) — D{0) = — D(^^0W), where 

D{d) = —21og(/(y|0)) is the model deviance and pd = D{d) — D{6) is an 
effective parameter count. 

The DIG can be used to discriminate between types of response distri¬ 
butions as well as different predictor specifications for a fixed distribution. 
The latter can also be implemented in a stepwise model choice strategy. 
However, since the DIG is sample-based, small differences of DIG values 
for competing models may induce a region of indecisiveness. If in such a 
situation sparser models are desired, the DIG-based selection of covariate 
effects can be assisted by only including significant effects, that is, effects 
for which the credible interval of a certain level does not contain the zero 
(parametric effects) or the zero line (nonparametric effects); compare also 
Section B.3.3.2. 
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For distributions and models considered in our applications, we conducted 
simulations on the performance of the DIG which are documented in detail in 
Section B.3.3. The basic outcome is that the DIG can discriminate between 
competing response distributions although differences can be rather small 
depending on what distributions are compared. Goncerning the identification 
of relevant covariates, we focused on spatial effects and found that the DIG 
usually is in clear favour of the true model if a relevant effect is omitted. In 
the reverse situation, that is, irrelevant information is included, the DIGs of 
the true models are only slightly smaller, but then the irrelevant covariate 
mainly yields an insignificant effect (i.e., the 95% credible interval of each 
region contains zero) and would thus be excluded under the aim of a sparser 
model. For count data distributional regression models, the performance of 
the DIG was also positively evaluated by Klein, Kneib and Lang (2015) who 
compare several misspecified models to the true model in terms of the DIG. 

The DIG values for the 12 income regression models under consideration 
are documented in Table 2 and indicate a clear preference for the model 
DA_M1. In general, it is noticeable that the DIG favours our flexible model 
specifications (Ml) compared to the simplified versions (M2, M3). 

3.2. Quantile residuals. For continuous random variables, it is a well- 
known result that the cumulative distribution function F{-) evaluated at 
the random variable yi yields a uniform distribution on [0,1]. As a conse¬ 
quence, quantile residuals defined as r* = <h“^(F(yi|'i9j)), with the inverse cu¬ 
mulative distribution function (c.d.f.) of a standard normal distribution 
and F{-\'&i) denoting c.d.f. with estimated parameters i?* = (i?ii,..., 


Table 2 

Comparison of DIG values (calculated based on the complete data set) and average scores 
obtained from ten-fold cross-validation 


Distribution 

DIG 

Quadratic score 

Logarithmic score 

Spherical score 

CRPS 

LN_M1 

179,090 

0.130 

-2.436 

0.362 

-2.158 

LN_M2 

180,533 

0.126 

-2.460 

0.357 

-2.141 

LN_M3 

179,451 

0.130 

-2.435 

0.362 

-2.163 

IG.M1 

184,614 

0.146 

-2.274 

0.378 

-1.620 

IGJV[2 

189,702 

0.138 

-2.314 

0.366 

-1.677 

IGJViS 

186,494 

0.144 

-2.282 

0.374 

-1.642 

GA_M1 

177,453 

0.161 

-2.172 

0.396 

-1.274 

GA_M2 

178,736 

0.156 

-2.181 

0.392 

-1.279 

GA_M3 

177,971 

0.160 

-2.174 

0.395 

-1.277 

DA_M1 

172,421 

0.168 

-2.103 

0.405 

-1.266 

DA_M2 

173,791 

0.164 

-2.120 

0.402 

-1.274 

DA_M3 

172,790 

0.167 

-2.108 

0.404 

-1.270 
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plugged in, should at least approximately be standard normally distributed 
if the correct model has been specified [Dunn and Smyth (1996)]. In practice, 
the residuals can be assessed graphically in terms of quantile-quantile-plots: 
the closer the residuals are to the bisecting line, the better the fit to the 
data. We suggest to use quantile residuals as an effective tool for deciding 
between different distributional options where strong deviations from the 
bisecting line allow us to sort out distributions that do not fit the data well. 

Quantile residuals are closely related to the probability integral transform 
(PIT) which considers Ui = F{yi\^i) without applying the inverse standard 
normal c.d.f. If the estimated model is a good approximation to the true 
data generating process, the Ui will then approximately follow a uniform 
distribution on [0,1]. As a graphical device, histograms of the Ui are then 
typically considered. 

Quantile residual plots for the models of type Ml are shown in Figure 2. 
Similar outcomes for model types M2/M3 and PITs for the models Ml can 
be found in Figures Al and A2, respectively. We prefer quantile residuals in 
the quantile-quantile-plot representation since they avoid the requirement 
to define breakpoints in the construction of the histogram. 

While none of the distributions provides a perfect fit for the data, the 
Dagum distribution turns out to be most appropriate for residuals in the 
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Eig. 2. Comparison of quantile residuals for the full models DA.Ml (topleft), LN.Ml 
(topright), IG- Ml (bottomleft), GA.Ml (bottomright). 
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range between —2 and 2 but deviates from the diagonal line for extreme 
residuals. In contrast, the log-normal and inverse Gaussian distribution seem 
to have problems in capturing the overall shape of the income distribution, 
resulting in sigmoidal deviations from the diagonal. Residuals of the gamma 
model are reasonable in the range between —2 and 2 (similar to the Dagum 
distribution) but deviate more strongly from the diagonal for extreme resid¬ 
uals. 

3.3. Proper scoring rules. Gneiting and Raftery (2007) propose proper 
scoring rules as summary measures for the evaluation of probabilistic fore¬ 
casts, that is, to evaluate the predictive ability of a statistical model. We 
consider three common scores, namely, the Brier or quadratic score (QS), 
the logarithmic score (LS) and the spherical score (SPS). For continuous re¬ 
sponse distributions with density Pr{y) = p{y\'&ri, ■ ■ ■ ,'&rK) and a given new 
realisation ynew, these are dehned as 

LS(pf,ynew) — log(Pr (l/new))) 

QS(pr;2/new) — new) j\Priy)\‘^ dy. 

Appropriate definitions for discrete as well as mixed discrete continuous 
responses are provided in Section B.1.2. As a fourth alternative, we consider 
the continuous ranked probability score (CRPS) 

/ OO 

(Friy) - 

-OO 

where Fr is the cumulative distribution function corresponding to the density 
Pr [Gneiting and Ranjan (2011)]. Laio and Tamea (2007) showed that the 
CRPS score can also be written as 

CRPS(pr ) J/new) = (a)} ~(Q^) ~ J/new) 

^ 0 

where F“^(a) is the quantile function of pr evaluated at the quantile level 
a€ (0,1). This formulation allows not only to look at the sum of all score 
contributions (i.e., the whole integral) but also to perform a quantile decom¬ 
position and to plot the mean quantile scores versus a in order to compare 
fits of specific quantiles [Gneiting and Ranjan (2011)]. This decomposition 
is especially helpful in situations where the quantile score can be interpreted 
as an economically relevant loss function [Gneiting (2011b)]. 

In practice, we obtain the probabilistic forecasts in terms of predictive 
distributions pr for observations yr by cross-validation, that is, the data 
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set is divided into subsets of approximately equal size and predictions for 
one of the subsets are obtained from estimates based on all the remain¬ 
ing subsets. Let be data in a hold-out sample and pr the pre¬ 
dictive distributions with predicted parameter vectors 1?^ = ("dri, • ■ • ■, 

r = 1,... ,R. Competing forecasts are then ranked by averaged scores S = 
S{pr-, yr) such that higher scores deliver better probabilistic forecasts 
when comparing different models. 

In our application, we conducted ten-fold cross-validation; observations 
are assigned randomly to the different folds. The scores discussed above are 
documented in Table 2 where the values are averages of the ten folds (and 
scores within the folds are themselves averages over the individual score 
contributions). In line with the DIG and the residual plots, the scores of the 
DA_MI model are the highest and thus deliver the best forecast among the 
12 models under consideration. Also similar to the DIG, models of type M2 
(the simplest versions) show lower scores compared to the ones of type M3 
and they themselves are inferior compared to the most flexible models of 
type Ml. 

In addition to the averages over the ten folds, the proper scoring rules 
can also be used to assess the predictive distributions in more detail. We 
illustrate this along a decomposition of the GRPS over quantile levels (Fig¬ 
ure 3) and a decomposition of the scores over the cross-validation folds; 
compare the supplement Section A.l. The quantile level decomposition of 
the GRPS again indicates a comparable performance of the Dagum distri¬ 
bution and the gamma distribution as compared to the inverse Gaussian 
distribution which performs somewhat worse and the log-normal distribu- 


Quantile decomposition 



Eig. 3. Quantile decomposition of CRPS in the full models DA^Ml, LN^Ml, IG-Ml, 
GA.M1. 
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tion which shows a considerably deteriorated behaviour. This ordering holds 
true over the complete range of quantiles. The fact that the log-normal dis¬ 
tribution fails to provide a competing predictive ability is most probably 
related to the strong impact of the extreme observations. These are hard to 
capture by the log-normal distribution in general. However, since extreme 
observations are typically also influential observations, they seem to impact 
estimates in the log-normal model to such an extent that even predictions 
for the central part of the distribution are affected negatively. 

4. Regional disparities of the distribution of labour income in Germany. 

As discussed in the Introduction, our main focus is on investigating differ¬ 
ences in conditional income distributions between former East and West 
Germany in the first decade of the new millennium. More specifically, we 
focus on differences in the inequality of the conditional income distribution 
as measured by the Gini coefficient [Silber (1999)] next to significant differ¬ 
ences in the first two moments of conditional income distributions. Based 
on our model choice, we illustrate the estimation results along the Dagum 
model DA_M1. 

In their seminal paper, DiNardo, Fortin and Lemieux (1996), stress the 
need to look at differences between the whole conditional income distribu¬ 
tions rather than just the conditional mean income, or certain indices. Using 
our proposed estimation procedure, this is feasible. Figure 1 displays an ex¬ 
emplary contrast of four conditional income distributions in a ceteris paribus 
type analysis. The four distributions have all but two covariates fixed at their 
average value. For age (42 years) and labour market experience (19 years) 
we use the arithmetic mean of the observations in our sample, while we fixed 
the random effects at their prior expectation, that is, at zero. Keeping these 
covariates fixed, we can observe the nature of the change if the regional vari¬ 
able is changed from East to West. For both educational levels, this hgure 
furthermore indicates that there is a noticeable difference not only in the 
mean value of the distributions but also in other aspects, like variability, 
skewness, etc. Thus, a simple analysis of means falls short of portraying a 
comprehensive picture of the differences in income between East and West. 

Note that for determining the densities displayed in Figure 1 we con¬ 
sider the posterior mean of the densities obtained in the different MGMC 
iterations instead of plugging in the posterior mean parameters in the cor¬ 
responding parametric densities. The availability of such posterior mean 
estimates is another advantage of the Bayesian inferential approach based 
on MGMC simulations. 

There are various additional aspects of the distribution that can be con¬ 
sidered. In principle, it is possible to obtain any distributional measure from 
the conditional distribution as long as it is defined for the given distribu¬ 
tion type and the corresponding parameter set. Here, we consider the mean. 
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the standard deviation and the Gini coefficient of the estimated conditional 
income distributions. While the mean provides important information on 
the location of the income distribution, the standard deviation provides in¬ 
formation on the scale of the distribution and the Gini coefficient is the 
most frequently used scalar measure on income inequality [Silber (1999)]. 
We therefore look at three important aspects of the conditional income dis¬ 
tributions and observe how they change over the covariate space. 

4.1. The spatial effect on conditional means and standard deviations. As¬ 
suming a Dagum distribution, the first two moments of the conditional in¬ 
come distributions of i/i can be found in Dagum (2008), respectively. Figure 4 
displays the posterior mean estimates for the expected incomes for each of 
the 96 regions {Raumordnungsregionen) and education. As described above, 
the other covariates are fixed at their mean. 

Unsurprisingly, there is a clearly visible divide between East and West 
Germany, as expected incomes are higher in the former Federal Republic 
of Germany for both education levels and at the average of the other co¬ 
variates. Abstracting from the variations at the district level, we get an 
expected income of 33,600 if the average man lives in the East and has no 
higher education. With higher education the income increases to 55,200. 

Without Higher Education With Higher Education 




■ J 

20,000€2oio 100,000€2oio 


■ J 

20,000^2010 100,000^2010 


Eig. 4. SOEP data. Posterior means for the expected incomes for 48-year-old males with 
19 years of working experience. Left: males without higher education. Right: males with 
higher education. 
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The corresponding values if a person with the same attributes lives in the 
West are 48,100 and 78,300. The difference between East and West is thus 
14,500 (12,000; 17,100) and 23,100 (19,000; 27,400) without and with higher 
education, respectively, where the numbers in the brackets denote the cor¬ 
responding 95% credible intervals. In addition to posterior means, we also 
looked at posterior medians. Overall, differences were negligible, which is 
in line with the theory suggesting asymptotic normality for the posterior 
distribution. 

The posterior mean estimates for the standard deviations of the condi¬ 
tional income distributions are shown in Figure 5. We prefer presenting the 
square roots of the second moments, that is, we consider the standard devi¬ 
ations rather than the variances for interpretability reasons. 

For standard deviations, the division between East and West is not as 
distinct as for the means. The main difference in the scale of the conditional 
distributions is found between the education levels and not along the dif¬ 
ferent regions or former two parts of Germany. Nonetheless, if we set the 
spatial random effect to zero again and only consider the structural effect, 
the resultant conditional distribution in the West has a standard deviation 
of 19,300, while that of the East has a standard deviation of 16,000 for 
those without higher education. For those with higher education the respec¬ 
tive numbers are 32,000 and 26,600. The difference between the standard 

Without Higher Education With Higher Education 
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Fig. 5. SOEP data. Posterior means for standard deviations for 42-year-old males with 
19 years of working experience. Left: males without higher education. Right: males with 
higher education. 
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deviations is thus 3300 (1300; 5200) in the group of lower educated males 
and 5400 (1700; 9100) for the one with higher educated males. 

Our results show that evaluated at the mean of other covariates, the first 
and second moment are significantly different in East and West Germany 
for both education levels, highlighting the diverse nature of the change of 
conditional income distributions. 

4.2. The spatial effect on the conditional income inequality. The Gini co¬ 
efficient is an inequality measure based on the Lorenz curve [Sarabia (2008)], 
which can vary between the value 0 (everybody has the same) and 1 (one 
person has everything). Note that the Gini coefficient is scale invariant such 
that in standard mean regression on log-incomes it would be postulated 
as constant across the covariate space. In analogy to the conditional mean 
income and standard deviation, the Gini coefficient of the conditional in¬ 
come distribution can easily be obtained from the parameter estimates of 
the Dagum distribution [Dagum (2008), page 104]. 

Figure 6 portrays the posterior mean estimates for the Gini coefficients 
for each region. As we can see, the differences are not as clear cut as for the 
conditional mean incomes. Nonetheless, the pattern emerging indicates that 
income inequality among 42-year-old males with 19 years of experience is 


Without Higher Education 


With Higher Education 
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Eig. 6. SOEP data. Posterior means for the Gini coefficients for 4Z-year-old males with 
19 years of working experience. Left: males without higher edueation. Right: males with 
higher edueation. 
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higher in the East for both education levels. Indeed, if we only consider the 
impact of the binary East-West variable on the Gini coefficient, we obtain a 
difference of the posterior means of 0.039 and 0.036 for those without higher 
education and those with higher education, respectively. The correspond¬ 
ing 95% credible intervals are [0.015,0.067] and [0.013,0.063], respectively. 
Thus, we have a significantly larger income inequality for 42-year-old males 
with 19 years of experience, as measured by the Gini coefficient, in the East 
than in the West. Putting these differences into perspective, the standard 
deviations of the Gini coefficients of the regions’ conditional income distri¬ 
butions within East and West are 0.030 and 0.031 for those without higher 
education, and 0.032 and 0.031 for those with higher education. Thus, the 
differences between East and West are not only significant, they also surpass 
their variation within East and West. 

4.3. Further analysis of the conditional income distribution. 

The effect of varying age and experience. Next to spatial effects, the 
impact of the other covariates can also be of interest. In the following, we 
focus on the effects of age and experience, while the effect of year is treated 
in Section A.1.1. Eor results on additional covariate sets, see Section A.1.2. 

In Figure 7, we display the expected conditional mean income and the 
Gini coefficient with respect to age and experience. In order to keep the 
dimension of the varying covariate to one, we simply assume that from the 
age of 21 onwards people gain one year of work experience as they grow 
older by one year. Here, we thus portray the development of expected in¬ 
comes and the Gini coefficient for full-time working males who have been 



Fig. 7. SOEP data. Posterior means for expected income (left) and Gini coefficients 
(right) for males who have been working since the age of 21, without higher education and 
living in the West, together with 95% simultaneous credible bands. 
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working since the age of 21. With regard to the categorical variables region 
and education, we consider only the West and lower education, respectively. 
The random effects for the Raumordnungsregion and year are considered at 
zero, that is, their prior expectation. The grey lines indicate the 95% simul¬ 
taneous credible bands. As expected, there is a general upward trend such 
that expected incomes are rising with increasing age. In addition, we see the 
concave structure that is generally also found by the literature. 

For the Gini coefficient, we observe a U-shaped development over age. 
This indicates that the conditional income distribution is not simply rescaled 
over the age range but rather that it changes its shape such that the Gini 
coefficient rises. Our results are again in line with economic theory. At the 
very beginning of the career, income inequalities should be rather high, as 
large parts are still not yet allocated in accordance to their capabilities 
and, consequently, are employed and paid more or less arbitrarily. These 
mismatch-induced inequalities quickly fade away. From then on we would 
expect rising inequality, as following the classical theories on the shape of 
the unconditional income distribution [Arnold (2008)]; the latter is made up 
of incomes derived from a varying number of autoregressive permutations. 
These permutations, which would generally occur over the age range under 
consideration, would lead to a rising inequality in incomes with rising age. 

Other quantities derived from eonditional ineome distributions. Using 
distributional regression, it is easily possible to obtain estimates for certain 
quantiles, like the median, which is an alternative to the mean as a location 
measure. Furthermore, one can calculate interquantile ranges as an alter¬ 
native measure of inequality. Naturally, such quantiles can be estimated in 
a more direct manner using quantile regression, although additional efforts 
may be required to avoid crossing quantile curves, in particular when consid¬ 
ering a dense set of quantile levels. Distributional regression automatically 
avoids the problem of quantile crossing and makes model comparison eas¬ 
ier in such situations. We contrast distributional regression against quantile 
regression in more detail for our case study in Section A.1.3. 

Next to measures of inequality like the Gini coefficient or the Theil index, 
which are easily computable, it is also straightforward to calculate measures 
of polarisation, which have recently received considerable attention in the 
literature [for further references and explanations, see, e.g., Wolfson (1994), 
Duclos, Esteban and Ray (2004)]. Following Gradfn (2000), it would be 
possible to calculate the polarisation between two groups as defined by sets 
of covariates. 

It is also possible to assess density differences at different income lev¬ 
els or probability mass differences for different income ranges. For instance, 
one could consider the probability mass above a certain income, for exam¬ 
ple, 48,000, which according to John Keynes would suffice to turn one’s mind 
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away from pecuniary worries [Skidelsky (2010)]. Consequently, it could be 
highlighted that not only the conditional mean income for the average man 
without higher education is lower in the East but also that the probability 
mass of incomes below that threshold is much lower. Such an analysis may 
be of particular interest for research questions on poverty and vulnerability 
[Pudney (1999)]. 

4.4. Economic consequences. Our findings show that keeping other vari¬ 
ables fixed at their average level, there are significant differences in income 
inequalities within East and West Germany. Duclos, Esteban and Ray (2004) 
have noted the importance of within-group inequality for levels of alienation 
and identification within society. The higher income inequality in the East 
would thereby induce a weakened in-group identity. Lack of in-group identity 
in turn is likely to cause feelings of isolation and mistrust [Misztal (2013)], 
and thus leads to a deterioration of well-being which is beyond that cap¬ 
tured by solely considering average incomes, or even distribution-adjusted 
well-being measures [Klasen (2008)]. 

While a profound analysis of the effect of different income distributions 
to well-being must be left for further research, our application shows that 
structured additive distributional regression offers a methodology to the 
analysis of income inequality which goes beyond the analysis at a highly 
aggregated level and thus allows to start the assessment of this important 
issue at a microeconomic level. 

5. Conclusion. Distributional regression and the closely related class of 
GAMLSS provide a flexible, comprehensive toolbox for solving complex re¬ 
gression problems with potentially nonstandard response types. They are 
therefore useful to overcome the limitations of common mean regression 
models and to enable a proper, realistic assessment of regression relation¬ 
ships. In this paper, we provided a Bayesian approach to distributional 
regression and described solutions for the most important applied prob¬ 
lems, including the selection of a suitable predictor specification and the 
most appropriate response distribution. Based on efficient MCMC simula¬ 
tion techniques, we developed a generic framework for inference in Bayesian 
structured additive distributional regression relying on distribution-specific 
iteratively weighted least squares proposals as a core feature of the algo¬ 
rithms. 

Concerning the specific application of distributional regression to condi¬ 
tional income distributions, there are significant differences between men 
with similar age, work experience and education levels between East and 
West which go beyond the mean income. Taking the Gini coefficient as an 
indicator for inequality, income inequality among these men is larger in the 
East than it is in the West, further deepening differences in well-being. While 
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this study highlights the scope of the new methodology to an application of 
income analysis and beyond, much work remains to be done on the applica¬ 
tion of distributional regression techniques. 

Despite the practical solutions outlined in this paper, model choice and 
variable selection remain relatively tedious and more automatic procedures 
would be highly desirable. Suitable approaches may be in the spirit of Belitz 
and Lang (2008) in a frequentist setting or based on spike and slab priors 
for Bayesian inference as developed in Scheipl, Fahrmeir and Kneib (2012) 
for mean regression. 

It will also be of interest to extend the distributional regression approach 
to the multivariate setting. For example, in the case of multivariate Gaus¬ 
sian responses, covariate effects on the correlation parameter may be very 
interesting in specific applications. Similarly, multivariate extensions of beta 
regression lead to Dirichlet distributed responses representing multiple per¬ 
centages that sum up to one; see Klein et al. (2015a) for a first attempt in 
this direction. 

In the context of economic applications, it should be noted that, anal¬ 
ogously to generalised linear models, the additive impact of explanatory 
variables on the economic measure of interest, like the Gini coefficient, is 
generally not attained. Gonsequently, the size, and possibly also the direc¬ 
tion of the estimated spatial effect, may well be very different for different 
points in the covariate space. While it is straightforward to calculate these 
differences with corresponding credible intervals for any desired combination 
of other covariates to give a more comprehensive assessment of differences 
in inequality, further work needs to be done to facilitate the interpretation 
of results. 

In addition, in-depth-testing is required to find adequate parametric forms 
for conditional income distributions, as the application of structured additive 
distributional regression crucially rests on the assumption that the paramet¬ 
ric distribution fits the data. While for the case of full-time working men 
the Dagum distribution indeed seems to provide a decent fit, further work 
must be done to allow for an analysis with a less restricted covariate space 
and thus a more comprehensive analysis of income distributions in Germany 
and beyond. 

Yet, this paper demonstrates that structured additive distributional re¬ 
gression offers a statistical framework addressing the challenge to assess 
entire conditional distributions [Fortin, Lemieux and Firpo (2011), page 56] 
by broadening the class of potential response distributions beyond simple 
exponential families and thus offers additional scope for applied statistical 
analyses on the problem of income inequality and beyond. 
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SUPPLEMENTARY MATERIAL 

Supplement A: Case studies (DOI: 10.1214/15-AOAS823SUPPA; .pdf). 
Additional material on the application to regional income inequality in Ger¬ 
many is provided in Section A.l. A second case study on the proportion of 
farm outputs achieved by cereals is treated in Section A.2. 

Supplement B: Methodology (DOI: 10.1214/15-AOAS823SUPPB; .pdf). 
This supplement comprises details on Bayesian inference, derivations of re¬ 
quired quantities for the iteratively weighted least squares proposals and 
simulation studies. 
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